Combinatorial protein design 

Jeffery G Saven 

Combinatorial protein libraries permit the examination of a wide 
range of sequences. Such methods are being used for de novo 
design and to investigate the determinants of protein folding. 
The exponentially large number of possible sequences, however, 
necessitates restrictions on the diversity of sequences in a 
combinatorial library. Recently, progress has been made in 
developing theoretical tools to bias and characterize the 
ensemble of sequences that fold into a given structure - tools 
that can be applied to the design and interpretation of 
combinatorial experiments. 

Addresses 

Department of Chemistry, University of Pennsylvania, 231 South 34 
Street, Philadelphia, Pennsylvania 19104, USA; 
e-mail: saven@sas.upenn.edu 

Current Opinion in Structural Biology 2002, 12:453-458 

0959-440X/02/$ - see front matter 

© 2002 Elsevier Science Ltd. All rights reserved. 

Introduction 

The discovery and design of novel proteins can lead to 
new, potentially practical proteins and can also enhance 
our understanding of protein biochemistry. Designing 
well-structured, soluble proteins is difficult, however, 
because of their complexity. Such proteins are large (tens 
to hundreds of amino acid residues) and have many variables 
that specify the folded state, including sequence, backbone 
topology and sidechain conformation. Design involves 
identifying those sequences that fold into a given structure 
from a huge ensemble of possible sequences. This search 
is aided, in part, by the large degree of consistency seen in 
folded proteins. On average, a folded structure is well 
packed, hydrophobic residues are sequestered from solvent 
and most potential hydrogen bond interactions are satisfied. 
This consistency, however, is often complex, may have 
little simplifying symmetry and involves predominantly 
noncovalent interactions. Such interactions are some of the 
most difficult to accurately quantify. As such, estimating 
the free energies associated with mutation or structural 
ordering remains a subtle area of computational research. 
Nonetheless, many molecular potentials do contain a 'best 
parameterization' of many of the interatomic interactions 
and forces that we know are important for stabilizing 
proteins. In some cases, such potentials have been used with 
striking success in protein design [1**]. Given that these 
potentials are necessarily approximate, however, one 
promising approach is to use the partial information con- 
tained in these functions in a probabilistic manner. A 
probabilistic or statistical approach is also appropriate for 
characterizing the full variability of sequences that fold 
to a common structure, because there are likely to be an 
enormous number of such sequences. Such statistical 
methods can be applied in 'shotgun' approaches to de novo 
protein design. Combinatorial experiments create and assay 
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many sequences in order to overcome shortcomings in 
our understanding of folding or other molecular properties. 
Even though combinatorial methods can address large 
numbers of sequences (10M0 12 ), these numbers are still 
infinitesimal in comparison to the numbers of possible 
sequences (e.g. 20 100 = 10 130 for a 100-residue protein). Thus, 
methods for winnowing and focusing sequence space are 
a vital component of combinatorial protein design. Herein, 
I briefly discuss combinatorial methods for full sequence 
design. I also review recent theoretical developments 
in characterizing sequence ensembles — developments 
that can be applied to the design and interpretation of 
combinatorial experiments. 

Directed protein design 

There has been much effort — and success — in developing 
computational methods for 'directed' protein design. By 
'directed protein design', I mean the identification of a 
sequence (or a small set of sequences) that is likely to 
fold into a predetermined backbone structure. Each such 
sequence can then be synthesized to confirm its folded 
structure and other molecular properties. Early efforts in 
design identified proteins with substantial order, but not 
necessarily well-defined tertiary structure [2]. Because an 
enormous number of sequences are possible even for 
small proteins (<50 residues), computational methods 
have dramatically accelerated successful design. Typically, 
such methods are implemented as an optimization 
process, whereby amino acid identity and sidechain 
conformation are varied in order to optimize a scoring 
function that quantifies sequence/structure compatibility. 
Exhaustive searching of all m N possible sequences (where 
m is the number of different amino acid types or 'states' 
per residue and N is the number of residues in a target 
protein structure) is feasible only if a small number of 
residues Ware allowed to vary or if the number of amino 
acids m is greatly reduced. If, in the optimization process, the 
different sidechain conformations (rotamer states) of each 
amino acid are also considered (see [3]), the complexity of 
the search increases still further, because w, the number of 
possible 'states' per residue, increases by a factor of ten 
or more. Although complete enumeration is typically not 
feasible, sequence space can be sampled in a directed 
manner in order to find optimal (or nearly optimal) 
sequences. Stochastic methods, such as genetic algorithms 
or simulated annealing, involve searching sequence space 
in a partially random fashion; on average, the search 
progressively moves toward better scoring (lower energy) 
sequences [4,5]. The partially random nature of the search 
permits escape from local minima in the sequence/rotamer 
landscape. Using a simplified model, the Takada and Tamura 
groups have included information about unfolded structures 
(negative design) in a stochastic search for a sequence with 
a 'funneled conformational energy landscape' [6]. One 
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47-residue three-helix bundle protein so selected has 
CD and NMR spectral features of folded proteins (W Jin, 
O Kambara, H Sasakawa, A Tamura, S Takada, personal 
communication). When applied to atomically detailed 
representations, the stochastic methods focus primarily 
on repacking the interior of a structure with hydrophobic 
residues [7] and have been applied to the wild-type 
structures of 434 Cro [8J, ubiquitin [9], the Bl domain of 
protein G [10*], the WW domain [1**J and helical bundles 
[11,12]. Although, in many cases, these methods have 
identified experimentally viable sequences [1**,13], sto- 
chastic search methods need not identify global optima [14 # ]. 
For potentials comprising only site and pair interactions, 
elimination methods such as 'dead end elimination' can find 
the global optimum [14*,15-17]. Such methods successively 
remove individual amino acid rotamer states that cannot be 
part of the global optimum until no further states can be 
eliminated. The Mayo group applied such methods to 
automate the full sequence design of both a 28-residue 
zinc finger mimic [18] and, after predetermining hydro- 
phobic and polar sites, a 51-residue homeodomain motif 
[19*]. The group has also redesigned portions of a variety 
of proteins [20-22]. Functional properties such as metal 
binding or catalysis may also be included as elements of 
the design process [23,24*]. The elements and algorithms 
of directed protein design have been the subject of several 
recent reviews [1**,25,26*]. 

Despite some striking successes, computational methods 
for directed design have limitations with respect to both 
identifying folding sequences and characterizing the 
features of protein sequences that share a common structure. 
Stochastic methods, such as simulated annealing or genetic 
algorithms, can be applied to large proteins and permit 
many sites to be varied simultaneously, but the compu- 
tational times and resources required for such calculations 
are extensive, even for small proteins. When used as 
optimization methods, directed approaches will necessarily 
be sensitive to the energy or scoring function used. All 
energy functions in use in protein design, however, are 
necessarily approximate and uncertainties in the energy 
function may not merit the search for global optima. 
Furthermore, many naturally occurring proteins are not 
optimized. In fact, most proteins are only marginally stable 
(e.g. AG°<10kcal/mol for folding) [27]. In addition, 
sequences that function, for example, those that bind another 
molecule, need not be the global optimum with respect to 
structural stability. Although stochastic methods can sample 
such suboptimal sequences, in general an exponentially 
large number of them will be possible and such sampling 
will be time consuming. Thus, it is important to develop 
methods complementary to those used for directed protein 
design — methods that reveal the features of sequences 
that are likely to fold into a particular structure but that may 
not be structurally 'optimal'. Such computational methods 
will have application to a new class of protein design studies, 
combinatorial experiments, in which large numbers of 
proteins may be simultaneously synthesized and screened. 



Combinatorial design 

Combinatorial design provides a complementary approach 
to directed design for understanding sequence/ structure 
compatibility and discovering novel sequences that fold 
into a specific structure. Combinatorial methods are 
powerful tools for cases in which we have an incomplete 
understanding of molecular properties. In protein combi- 
natorial design experiments, large numbers of sequences 
(libraries) are screened for evidence of folding into a 
predetermined structure. A combinatorial experiment has 
two key elements: creating a library with a desired degree 
of diversity and assaying for sequences with 'protein-like' 
properties in terms of their structure or function. Depending 
upon how the diversity is generated and assayed, experi- 
ments of this type can explore a large number of sequences, 
up to 10 12 [28*]. Certainly, such methods can be used to 
discover 'hits', that is, a few sequences that are especially 
stable or that are unusually strong in their function or 
binding properties. In addition, combinatorial experiments 
readily generate a sequence ensemble. Thus, using combi- 
natorial experiments, we can potentially 'expand the protein 
sequence database' and the diversity of these additional 
sequences will be at the control of the researcher. Features 
important to folding (and other properties) may be explored 
in a way that is decoupled from the evolutionary require- 
ments of nature's proteins. For example, these methods 
have been used to identify helical proteins [29-31], 
ubiquitin variants [32], self-assembled protein monolayers 
[33], proteins with amyloid-like properties [33], metal- 
binding peptides [34] and stable interhelical oligomers [35]. 
Several excellent reviews of combinatorial experiments have 
appeared recently [36,37,38 # ,39' # ]. 

The complexity of combinatorial experiments implies that 
limitations must be placed on the sequences, because the 
number that can be created and screened (10 6 -10 12 ) is 
infinitesimal compared to the number possible (e.g. 10 130 ). 
Limitations on sequence properties are often guided by 
qualitative chemical considerations, but quantitative 
computational methods will be helpful in designing and 
interpreting combinatorial experiments. 

The Hecht group has probed the extent to which the 
patterning of hydrophobic and hydrophilic residues can 
successfully reduce complexity in combinatorial design. 
While maintaining the periodicity of a helices and P sheets 
in particular tertiary structures, such patterning is applied 
in order to expose hydrophilic residues to solvent and to 
sequester hydrophobic residues in the interior of the 
protein. Early targets were helical proteins; a fiducial 
74-residue four-helix bundle was the template structure [40]. 
Such a structure has more than 20 74 «10 % possible sequences. 
After binary patterning, five hydrophobic and six 
hydrophilic amino acids were permitted at 24 interior 
and 36 exterior positions, respectively, thus reducing the 
total number of possible sequences to 10 41 . From a pro- 
tein library consistent with this binary patterning, a set of 
50 correctly expressed sequences was selected for further 
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study. Around half of the 50 sequences isolated are protein- 
like in many respects [30], including their thermal 
denaturation [41]. About half the isolated sequences 
also bind heme [29] and many of these display carbon 
monoxide binding [42 # ] or peroxidase activity [43]. This is 
• surprising given that such functions were not part of the 
design or selection of the sequences. In a second-generation 
design, the group added six residues to each of the four 
helices of one of the most protein-like sequences. The 
additional residues were combinatorially patterned, as 
in the original experiment [39"]. For these 102-residue 
sequences, the free energies of folding are increased 
2-3-fold and the NMR data suggest we II -determined 
structures. Using binary patterning of hydrophobicity 
consistent with an amphilic P sheet [44], the Hecht group 
has also identified proteins that aggregate to form amyloid 
fibrils [45] and crafted monomeric P proteins by introducing 
a nonpolar lysine mutation at the 'edge' strand of the 
target P sheet [46' # ], 

Despite the striking results from hydrophobic patterning, 
more detailed methods for library design are merited. Many 
of the hydrophobically patterned sequences that appear 
well structured are not sufficiently soluble for NMR 
structure determination [46 ## ] and, as a result, little is 
known concerning their structures at the atomic scale. Not 
all of the oc-helical sequences exhibit the sharp thermal 
transition seen in natural proteins (usually associated with a 
large AH of folding). Such sequences may not possess 
well-packed interiors [41]. In natural proteins, the side- 
chains of most interior residues are well determined, as 
opposed to the variability that is obtained using hydrophobic 
patterning alone and that is observed in many de novo 
designed proteins [13,18]. A more fine-grained dictation of 
the amino acid identities is probably necessary for obtaining 
libraries that are rich in sequences with well-defined struc- 
tures. Moreover, a more detailed specification of amino acid 
identities yields fewer sequences than hydrophobic pattern- 
ing alone and further reduces the complexity of the library. 

Theories of combinatorial libraries 

Surveying the complete sequence landscape of proteins 
seems, at first glance, intractable to both experiment and 
computation. In addition to the enormous number of 
possible sequences, many examples exist in nature of dis- 
similar sequences folding to essentially the same structure. 
Hence, sequence properties are nontrivial and proteins 
sharing a common structure can be nonlocal in sequence 
space. Nonetheless, computational methods permit us to 
estimate the properties, particularly the amino acid proba- 
bilities, of sequences consistent with a target structure. 

Repeated use of directed search methods can estimate the 
properties of an ensemble of sequences. Desjarlais and 
co-workers have used independent runs of their sequence 
prediction algorithm across an ensemble of closely 
related structures all consistent with a particular fold 
OR Desjarlais et aL, personal communication). For each 



structure, an optimal 'nucleating' sequence is identified 
and subsequently the sequence/rotamer variability is 
explored throughout the structure. The method identifies 
effective reduced partition sums for each sequence/rotamer 
state and amino acid probabilities may be obtained at each 
residue position. The number of sequences decreases 
with stability, so the degree of complexity can be tuned by 
varying a cutoff in the effective free energies of the 
sequences. The method has been used to identify sequences 
consistent with the fold of a WW domain, a small P-sheet 
protein [1**], some of which are currently being experi- 
mentally characterized. 

The amino acid frequencies can also be determined 
directly, using a statistical theory of combinatorial libraries 
[47,48 ## ,49* # ]. Ideas from statistical mechanics are used to 
address the number and composition of sequences that are 
consistent with a particular backbone structure. The theory 
addresses the whole space of available compositions, not 
just the small fraction that is accessible to experiment and 
to computational enumeration and sampling. The theory 
takes as input a target backbone structure and a scoring 
or energy function for quantifying sequence/structure 
compatibility. Global and local features can be prespecified 
using constraints on the sequences. For example, such 
constraints can be used to determine the energy the 
sequences assume in the target structure, the patterning of 
amino acids and the number of each amino acid present 
(composition). The theory yields estimates -of both the 
number of sequences consistent with these constraints and 
the amino acid probabilities at each residue position. 
These residue-specific probabilities are the most probable 
such set and are determined — as in statistical mechanics 
— by maximizing an effective entropy, whereby this 
maximization is subject to constraints. Just as in thermo- 
dynamics, the judicious use of constraints can be used to 
reduce the entropy or the number of possible sequences. 
Thus, these methods provide a systematic means to focus 
the library, winnowing numbers such as 10 130 to numbers 
that are experimentally manageable, for example, 10 6 . The 
theory agrees well with exact results obtained with lattice 
models of proteins [47,48"]. This method has been 
extended to realistic representations of proteins, in 
which the effects of sidechain packing are included in 
an atom-based manner [49"]. The calculated sequence 
probabilities of the immunoglobulin light chain binding 
domain of protein L are in agreement with the frequencies 
observed in combinatorial phage display experiments [50,51]. 
These statistical methods have several advantages. They 
may be applied to much larger proteins (A^>100 residues) 
and permit much larger sequence variation than many 
directed methods. They are sufficiently rapid that many 
backbone structures may be considered and those features 
that are robust with respect to minor structure modifications 
may be identified. Importantly, such methods provide 
perhaps the most natural input for a combinatorial exper- 
iment, the probabilities of the amino acids at each position 
among the sequences of a library. These amino acid 
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probabilities can also be used to identify specific amino 
acid sequences, which can then be synthesized; a consensus 
sequence comprising the most probable amino acid at each 
site can be selected or the probabilities can be used to bias 
a stochastic search for viable sequences (J Zou, JG Saven, 
unpublished data). 

If the energy of the target state is one of the constraints, 
the statistical method reduces to an effective mean field 
theory. Mean field theories have seen extensive application 
in physical science and in biomolecular theory [52], and 
to protein evolution and natural sequence variability ([53]; 
H Kono, JG Saven, unpublished data). Voigt etal [14*] have 
compared mean field theories with directed search 
methods for identifying ground state sequence/rotamer 
combinations in protein design. They found that, although 
often more rapid, mean field theories do not always identify 
such ground states. Interestingly, Voigt etal. applied the mean 
field theory to large proteins (subtilisin E and T4 lysozyme) 
to determine local site entropies, s b where expCr,) quantifies 
the effective number of amino acids allowed at residue / 
in a structure [54**,55], Sites with large values of s h those 
most tolerant to mutation [56], are likely to support sub- 
stitutions that improve stability or function when in vitro 
evolution experiments are used to explore sequence 
space [37]. For such experiments, the mutation rate is low 
enough that multiple mutations of strongly interacting sites 
are rare. Thus, mutations that improve 'fitness' are most 
likely to accumulate at sites that are the most 'decoupled' 
from other sites. Such mutations can potentially be targeted 
for variation in an in vitro evolution experiment. 

Conclusions 

Much recent progress has been seen in the design and 
discovery of new proteins, and combinatorial approaches 
are accelerating the pace. Such methods are most useful 
when our quantitative understanding of important protein 
properties, such as stability and catalytic activity, is limited. 
Not only can combinatorial methods be used for discovery 
but also, more deeply, they can inform our understanding 
of protein properties by generating and assaying whole 
ensembles of sequences. Traditionally, advances in 
structural biology have come from examining the structures 
of naturally occurring proteins, but, with combinatorial 
experiments, an enormous diversity of sequences can be 
generated at the control of the researcher. Detailed ques- 
tions can be addressed, such as the utility of hydrophobic 
patterning or of predetermining particular sites for amino 
acid variation. Theory and simulation will continue to aid 
the design and interpretation of combinatorial experiments. 
Such methods will also facilitate the exploration of what is 
possible with the amino acids: how diverse is the set of all 
possible sequences that fold to a particular structure and 
what structures not yet seen in nature can be crafted with 
the amino acids? Such methods will perhaps have an even 
more profound impact on designing nonbiological 
foldamers [57 ## ], structures about which we have much less 
empirical information than we do about biopolymers. 
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